Machine Learning Foundations & Techniques

Machine learning is the study that allows computers to adaptively improve their performance with experience accumulated from the data observed. Our two sister courses teach the most fundamental algorithmic, theoretical and practical tools that any user of machine learning needs to know. This first course of the two would focus more on mathematical tools, and the other course would focus more on algorithmic tools.

Machine Learning Foundations

Topic 1: When can machines learn?

###Lecture 1: The Learning Problem

A takes D and H to get g

  • Course Introduction

    foundation oriented and story-like

  • What is Machine Learning

    use data to approximate target

  • Applications of Machine Learning

    almost everywhere

  • Components of Machine Learning

    A takes D and H to get g

  • Machine Learning and Other Fields

    related to Date Mining, Artificial Intelligence and Statstics

Lecture 2: Learning to Answer Yes/No

PLA A takes linear separable D and perceptrons H to get hypothesis g

  • Perceptron Hypothesis Set

    hyperplanes/linear classifiers in Rd

  • Perceptron Learning Algorithm (PLA)

    correct mistakes and improve iteratively

  • Guarantee of PLA

    no mistake eventually if linear separable

  • Non-Separable Data

    hold somewhat ‘best’ weights in pocket

Lecture 3: Types of Learning

focus: binary classification or regression from a batch of supervised data with concrete features

  • Learning with Different Output Space Y

    [classification], [regression], structured

  • Learning with Different Data Label yn

    [supervised], un/semi-supervised, reinforcement

  • Learning with Different Protocol f ⇒ (xn, yn)

    [batch], online, active

  • Learning with Different Input Space X

    [concrete], raw, abstract

Lecture 4: feasibility of learning

learning is PAC-possible if enough statistical data and finite |H|

  • Learning is Impossible?

    absolutely no free lunch outside D

  • Probability to the Rescue

    probably approximately correct outside D

  • Connection to Learning

    verification possible if Ein(h) small for fixed h

  • Connection to Real Learning

    learning possible if |H| finite and Ein(g) small

Topic 2: Why Can Machines Learn?

Lecture 5: training versus testing

effective price of choice in training: (wishfully) growth function mH(N) with a break point

  • Recap and Preview

    two questions: Eout(g) ≈ Ein(g), and Ein(g) ≈ 0

  • Effective Number of Lines

    at most 14 through the eye of 4 inputs

  • Effective Number of Hypotheses

    at most mH(N) through the eye of N inputs

  • Break Point

    when mH(N) becomes ‘non-exponential’

Lecture 6: theory of generalization

Eout ≈ Ein possible if mH(N) breaks somewhere and N large enough

  • Restriction of Break Point

    break point ‘breaks’ consequent points

  • Bounding Function: Basic Cases

    B(N,k) bounds mH(N) with break point k

  • Bounding Function: Inductive Cases

    B(N,k) is poly(N)

  • A Pictorial Proof

    mH(N) can replace M with a few changes

Lecture 7: the VC dimension

learning happens if finite dVC, large N, and low Ein

  • Definition of VC Dimension

    maximum non-break point

  • VC Dimension of Perceptrons

    dVC(H) = d + 1

  • Physical Intuition of VC Dimension

    dVC ≈ #free parameters

  • Interpreting VC Dimension

    loosely: model complexity & sample complexity

Lecture 8: noise and error

learning can happen with target distribution P(y|x) and low Ein w.r.t. err

  • Noise and Probabilistic Target

    can replace f(x) by P(y|x)

  • Error Measure

    affect ‘ideal’ target

  • Algorithmic Error Measure

    user-dependent =⇒ plausible or friendly

  • Weighted Classification

    easily done by virtual ‘example copying’

Topic 3: How Can Machines Learn?

Lecture 9: linear regression

analytic solution wLIN = X†y with linear regression hypotheses and squared error

  • Linear Regression Problem

    use hyperplanes to approximate real values

  • Linear Regression Algorithm

    analytic solution with pseudo-inverse

  • Generalization Issue

    Eout − Ein ≈ 2(d+1) on average N

  • Linear Regression for Binary Classification

    0/1 error ≤ squared error

Lecture 10: logistic regression

gradient descent on cross-entropy error to get good logistic hypothesis

  • Logistic Regression Problem

    P(+1|x) as target and θ(wT x) as hypotheses

  • Logistic Regression Error

    cross-entropy (negative log likelihood)

  • Gradient of Logistic Regression Error

    θ-weighted sum of data vectors

  • Gradient Descent

    roll downhill by −∇Ein(w)

Lecture 11: linear models for classification

binary classification via (logistic) regression; multiclass via OVA/OVO decomposition

  • Linear Models for Binary Classification

    three models useful in different ways

  • Stochastic Gradient Descent

    follow negative stochastic gradient

  • Multiclass via Logistic Regression

    predict with maximum estimated P(k|x)

  • Multiclass via Binary Classification

    predict the tournament champion

Lecture 12: nonlinear transformation

nonlinear X􏰡 via nonlinear feature transform Φ plus linear X with price of model complexity

  • Quadratic Hypotheses

    linear hypotheses on quadratic-transformed data

  • Nonlinear Transform

    happy linear modeling after Z = Φ(X )

  • Price of Nonlinear Transform

    computation/storage/[model complexity]

  • Structured Hypothesis Sets

    linear/simpler model first

Topic 4: How Can Machines Learn Better?

Lecture 13: hazard of overfitting

overfitting happens with excessive power, stochastic/deterministic noise, and limited data

  • What is Overfitting?

    lower Ein but higher Eout

  • The Role of Noise and Data Size

    overfitting ‘easily’ happens!

  • Deterministic Noise

    what H cannot capture acts like noise

  • Dealing with Overfitting

    data cleaning/pruning/hinting, and more

Lecture 14: regularization

minimizes augmented error, where the added regularizer effectively limits model complexity

  • Regularized Hypothesis Set

    original H + constraint

  • Weight Decay Regularization

    add λ/N*wTw in Eaug

  • Regularization and VC Theory

    regularization decreases dEFF

  • General Regularizers

    target-dependent, [plausible], or [friendly]

Lecture 15: validation

(crossly) reserve validation data to simulate testing procedure for model selection

  • Model Selection Problem

    dangerous by Ein and dishonest by Etest

  • Validation

    select with Eval(Am(Dtrain)) while returning Am∗ (D)

  • Leave-One-Out Cross Validation

    huge computation for almost unbiased estimate

  • V -Fold Cross Validation

    reasonable computation and performance

Lecture 16: three learning principle

  • Occam’s Razor

    simple, simple, simple!

  • Sampling Bias

    match test scenario as much as possible

  • Data Snooping

    any use of data is ‘contamination’

  • Power of Three

    relatives, bounds, models, tools, principles

Machine Learning Techniques

Topic 1: Embedding Numerous Features: Kernel Models

Lecture 1: Linear Support Vector Machine

linear SVM: more robust and solvable with quadratic programming

  • Course Introduction

    from foundations to techniques

  • Large-Margin Separating Hyperplane

    intuitively more robust against noise

  • Standard Large-Margin Problem

    minimize ‘length of w’ at special separating scale

  • Support Vector Machine

    ‘easy’ via quadratic programming

  • Reasons behind Large-Margin Hyperplane

    fewer dichotomies and better generalization

Lecture 2: Dual Support Vector Machine

dual SVM: another QP with valuable geometric messages and almost no dependence on d ̃

  • Motivation of Dual SVM

    want to remove dependence on d ̃

  • Lagrange Dual SVM

    KKT conditions link primal/dual

  • Solving Dual SVM

    another QP, better solved with special solver

  • Messages behind Dual SVM

    SVs represent fattest hyperplane

Lecture 3: Kernel Support Vector Machine

kernel as a shortcut to (transform + inner product) to remove dependence on d ̃: allowing a spectrum of simple (linear) models to infinite dimensional (Gaussian) ones with margin control

  • Kernel Trick

    kernel as shortcut of transform + inner product

  • Polynomial Kernel

    embeds specially-scaled polynomial transform

  • Gaussian Kernel

    embeds infinite dimensional transform

  • Comparison of Kernels

    linear for efficiency or Gaussian for power

Lecture 4: Soft-Margin Support Vector Machine

allow some margin violations ξn while penalizing them by C; equivalent to upper-bounding αn by C

  • Motivation and Primal Problem

    add margin violations ξn

  • Dual Problem

    upper-bound αn by C

  • Messages behind Soft-Margin SVM

    bounded/free SVs for data analysis

  • Model Selection

    cross-validation, or approximately nSV

Lecture 5: Kernel Logistic Regression

two-level learning for SVM-like sparse model for soft classification, or using representer theorem with regularized logistic error for dense model

  • Soft-Margin SVM as Regularized Model

    L2-regularization with hinge error measure

  • SVM versus Logistic Regression

    ≈ L2-regularized logistic regression

  • SVM for Soft Binary Classification

    common approach: two-level learning

  • Kernel Logistic Regression

    representer theorem on L2-regularized LogReg

Lecture 6: Support Vector Regression

kernel ridge regression (dense) via ridge regression + representer theorem; support vector regression (sparse) via regularized tube error + Lagrange dual

  • Kernel Ridge Regression

    representer theorem on ridge regression

  • Support Vector Regression Primal

    minimize regularized tube errors

  • Support Vector Regression Dual

    a QP similar to SVM dual

  • Summary of Kernel Models

    with great power comes great responsibility

Topic 2: Combining Predictive Features: Aggregation Models

Lecture 7: Blending and Bagging

blending known diverse hypotheses uniformly, linearly, or even non-linearly; obtaining diverse hypotheses from bootstrapped data

  • Motivation of Aggregation

    aggregated G strong and/or moderate

  • Uniform Blending

    diverse hypotheses, ‘one vote, one value’

  • Linear and Any Blending

    two-level learning with hypotheses as transform

  • Bagging (Bootstrap Aggregation)

    bootstrapping for diverse hypotheses

Lecture 8: Adaptive Boosting

optimal re-weighting for diverse hypotheses and adaptive linear aggregation to boost ‘weak’ algorithms

  • Motivation of Boosting

    aggregate weak hypotheses for strength

  • Diversity by Re-weighting

    scale up incorrect, scale down correct

  • Adaptive Boosting Algorithm

    two heads are better than one, theoretically

  • Adaptive Boosting in Action

    AdaBoost-Stump useful and efficient

Lecture 9: Decision Tree

recursive branching (purification) for conditional aggregation of constant hypotheses

  • Decision Tree Hypothesis

    express path-conditional aggregation

  • Decision Tree Algorithm

    recursive branching until termination to base

  • Decision Tree Heuristics in C&RT

    pruning, categorical branching, surrogate

  • Decision Tree in Action

    explainable and efficient

Lecture 10: Random Forest

bagging of randomized C&RT trees with automatic validation and feature selection

  • Random Forest Algorithm

    bag of trees on randomly projected subspaces

  • Out-Of-Bag Estimate

    self-validation with OOB examples

  • Feature Selection

    permutation test for feature importance

  • Random Forest in Action

    ‘smooth’ boundary with many trees

Lecture 11: Gradient Boosted Decision Tree

aggregating trees from functional gradient and steepest descent subject to any error measure

  • Adaptive Boosted Decision Tree

    sampling and pruning for ‘weak’ trees

  • Optimization View of AdaBoost

    functional gradient descent on exponential error

  • Gradient Boosting

    iterative steepest residual fitting

  • Summary of Aggregation Models

    some cure underfitting; some cure overfitting

Topic 3: Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

automatic pattern feature extraction from layers of neurons with backprop for GD/SGD

  • Motivation

    multi-layer for power with biological inspirations

  • Neural Network Hypothesis

    layered pattern extraction until linear hypothesis

  • Neural Network Learning

    backprop to compute gradient efficiently

  • Optimization and Regularization

    tricks on initialization, regularizer, early stopping

Lecture 13: Deep Learning

pre-training with denoising autoencoder (non-linear PCA) and fine-tuning with backprop for NNet with many layers

  • Deep Neural Network

    difficult hierarchical feature extraction problem

  • Autoencoder

    unsupervised NNet learning of representation

  • Denoising Autoencoder

    using noise as hints for regularization

  • Principal Component Analysis

    linear autoencoder variant for data processing

Lecture 14: Radial Basis Function Network

linear aggregation of distance-based similarities using k-Means clustering for prototype finding

  • RBF Network Hypothesis

prototypes instead of neurons as transform

  • RBF Network Learning

linear aggregation of prototype ‘hypotheses’

  • k-Means Algorithm

clustering with alternating optimization

  • k-Means and RBF Network in Action

proper choice of # prototypes important

Lecture 15: Matrix Factorization

linear models of movies on extracted user features (or vice versa) jointly optimized with stochastic gradient descent

  • Linear Network Hypothesis

feature extraction from binary vector encoding

  • Basic Matrix Factorization

alternating least squares between user/movie

  • Stochastic Gradient Descent

efficient and easily modified for practical use

  • Summary of Extraction Models

powerful thus need careful use

Lecture 16: Finale

  • Feature Exploitation Techniques

    kernel, aggregation, extraction, low-dimensional

  • Error Optimization Techniques

    gradient, equivalence, stages

  • Overfitting Elimination Techniques

    (lots of) regularization, validation

  • Machine Learning in Practice

    welcome to the jungle

Notes

  • Terms

    • 正则 岭回归
    • 伪逆矩阵
    • 离散函数
    • 拉格朗日对偶法
    • 拉格朗日最小化
    • 拉格朗日乘法
    • 空间转换
    • 原始SVM
    • 对偶SVM
    • Decision Stump
  • L1-5 25

  • L2-19

  • L4-18 19

  • L5-4 5 9 15 17

  • L7基于以上的两个例子,我们得到了aggregation的两个优势:feature transform和regularization。我们之前在机器学习基石课程中就介绍过,feature transform和regularization是对立的,还把它们分别比作踩油门和踩刹车。如果进行feature transform,那么regularization的效果通常很差,反之亦然。也就是说,单一模型通常只能倾向于feature transform和regularization之一,在两者之间做个权衡。但是aggregation却能将feature transform和regularization各自的优势结合起来,好比把油门和刹车都控制得很好,从而得到不错的预测模型。

  • L07-20 bootstrap aggregation (BAGging): a simple meta algorithm on top of base algorithm A

  • L07-21 值得注意的是,只有当演算法对数据样本分布比较敏感的情况下,才有比较好的表现。

  • L08-20

  • L09-10

  • L10-2Bagging能减小variance,而Decision Tree能增大variance

  • L11-10

  • L12-6感知机模型每个节点的输入就对应神经元的树突dendrite,感知机每个节点的输出就对应神经元的轴突axon

  • L12-13 16GD 21

  • L14-5 9 10